HaiBiostat: What could we do if Latent Profile Analysis (LPA) run not as expected? -- LPA of Chicago Neighborhoods

Hai Nguyen

library(mclust)
library(tidyverse)
library(nFactors)

Story

Chicago is famous of diverse neighborhood areas. It’s one thing, but how we can measure each neighborhood area based on the socioeconomic status of each area.

The written compose (including step-by-step in coding and explanation) walk us through how to cluster each neighborhood using a modern method: Latent Profile Analysis. However, the method would have some barriers (model selection criteria, homogeneity and separation, normality assumption) in which we need to use another technique to overcome.

Outline:

1- Prepare ACS data
2- What is LPA
3- Barriers of LPA and the Steps to solve it
4- Mapping
5- LPA with covariate(s)

Prepare data

The American Community Survey

The American Community Survey (ACS) is a unique data product that includes all the estimates and margins of error from geographies that are published for the ACS.

Data contained demographic, social, economic, and housing subject areas. All Detailed Tables for the ACS 1-year or 5-year estimates. Here we deal with the ACS 5-year estimates which are published for all geographic areas, including census tracts, block groups, American Indian areas, core based statistical areas, combined statistical areas, Congressional districts, and state legislative districts.

ACS data

The summary of the data “Chicago Census Data by Tract_American Census Survey (ACS)”. Data includes Census tract characteristics of 2010-2014 5-year ACS estimates that easy to download from https://data.census.gov/cedsci/.

We will use the following \(11\) variables in the exploratory factor analysis:

Variable name	Labeling
* AgeDependencyRatio:	age dependency ratio
* LimitedEngProf5andUP:	limited English proficiency
* LessThanHS:	less than high school in education level
* Unemployment
* PctForeignBorn:	percentage of foreign born
* FemaleHHPct:	percentage of female per household
* MedianIncomeHH:	median income per household
* AllOcc2009Pct:	percentage of all occupants in 2009
* PctVacHousing:	percentage of vacant housing
* PctBelowPoverty:	percentage below poverty
* PublicAssistancePct:	percentage of public assistance

census <- read.csv("data/Chicago Census Data by Tract_ACS 2015.csv", header = TRUE) %>% 
  select(tract, AgeDependencyRatio, LimitedEngProf5andUP, LessThanHS, Unemployment, 
         PctForeignBorn, FemaleHHPct, MedianIncomeHH, AllOcc2009Pct, PctVacHousing, 
         PctBelowPoverty, PublicAssistancePct)

# recode into % to scale to other variables
df <- census %>% mutate(
  PctForeignBorn = PctForeignBorn*100,
  FemaleHHPct = FemaleHHPct*100,
  # MedianIncomeHH1Kinv = (max(MedianIncomeHH, na.rm=TRUE) - MedianIncomeHH)/1000,
  # we would not use 1Kinv because of weird distribution (multi-mode)
  MedianIncomeHH1K = MedianIncomeHH/1000,
  AllOcc2009Pct = AllOcc2009Pct*100,
  PctVacHousing = PctVacHousing*100,
  PublicAssistancePct = PublicAssistancePct*100
) %>% 
  select(tract, AgeDependencyRatio, LimitedEngProf5andUP, LessThanHS, Unemployment, 
         PctForeignBorn, FemaleHHPct, MedianIncomeHH1K, AllOcc2009Pct, PctVacHousing, 
         PctBelowPoverty, PublicAssistancePct)
df2 <- df.fa <- df[complete.cases(df),]

summary(df2)

     tract        AgeDependencyRatio LimitedEngProf5andUP
 Min.   : 10100   Min.   :  4.8      Min.   : 0.00       
 1st Qu.:160800   1st Qu.: 39.8      1st Qu.: 1.90       
 Median :351100   Median : 55.0      Median : 8.40       
 Mean   :403974   Mean   : 52.9      Mean   :13.99       
 3rd Qu.:670300   3rd Qu.: 65.3      3rd Qu.:24.00       
 Max.   :843900   Max.   :147.4      Max.   :64.00       
   LessThanHS     Unemployment   PctForeignBorn    FemaleHHPct    
 Min.   : 0.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 8.20   1st Qu.: 7.00   1st Qu.: 3.858   1st Qu.: 8.704  
 Median :16.30   Median :11.90   Median :15.115   Median :16.502  
 Mean   :18.39   Mean   :14.52   Mean   :18.659   Mean   :20.122  
 3rd Qu.:26.20   3rd Qu.:19.80   3rd Qu.:31.076   3rd Qu.:29.859  
 Max.   :62.50   Max.   :91.90   Max.   :71.263   Max.   :70.470  
 MedianIncomeHH1K AllOcc2009Pct   PctVacHousing    PctBelowPoverty
 Min.   :  5.00   Min.   : 0.00   Min.   : 0.000   Min.   : 0.40  
 1st Qu.: 29.40   1st Qu.:53.45   1st Qu.: 7.752   1st Qu.:12.50  
 Median : 42.32   Median :62.85   Median :11.722   Median :21.80  
 Mean   : 49.66   Mean   :62.85   Mean   :13.898   Mean   :24.07  
 3rd Qu.: 62.06   3rd Qu.:72.80   3rd Qu.:18.066   3rd Qu.:33.80  
 Max.   :160.46   Max.   :94.06   Max.   :50.673   Max.   :74.20  
 PublicAssistancePct
 Min.   : 0.000     
 1st Qu.: 9.252     
 Median :22.002     
 Mean   :24.440     
 3rd Qu.:36.667     
 Max.   :79.851

Correlation matrix

cortab <- round(Hmisc::rcorr(as.matrix(df2[,-1]))[[1]],2)
rmarkdown::paged_table(as.data.frame(cortab), options = list(rows.print = 11, cols.print = 11))

corrplot::corrplot(cor(df2[,-1]))

As we can see, there are some factors having much association (Unemployment, FemaleHHPct, PctBelowPoverty, PublicAssistancePct), meanwhile some is not.

Introduction to Latent Profile Analysis

Individuals can be divided into subgroups based on unobservable construct
- The construct of interest is the latent variable
- Subgroups are called latent classes
Latent profile analysis: static, categorical latent variable measured with continuous items
True class membership is unknown
- Unknown due to measurement error
- Measurement of the construct is typically based on several continuous, normally-distributed indicators
Latent classes are mutually exclusive & exhaustive
Class-specific means in LPA are analogous to factor loadings
- Factor loadings are \(\beta\)-weights
- In LPA, we have means and variances
Mathematical model
- \(p\) variables used as indicators (\(i = 1, ..., p\))
- \(K\) classes (\(j = 1, ..., K\))
- \(\eta_j\) = latent class membership probability
- \(x_i\) = observed response
- \(\mu_{ij}\) = within-class mean for observed responses
- \(\sigma_{ij}^2\) = within-class variance for observed responses
\[f(x_i) = \sum_{j=1}^K \eta_j \prod_{i=1}^p \frac{1}{\sqrt(2\pi\sigma_{ij}^2)}\text{exp}\Big[\frac{-(x_i-\mu_{ij})^2}{\sigma_{ij}^2}\Big]\]

What if running Latent Profile Analysis with all 11 indicators

X <- as.matrix(df2[,-1])
mod <- Mclust(X)

We could use both BIC and ICL. But let’s see the BIC firstly

summary(mod$BIC)

Best BIC values:
             VVE,9        EVE,8        EVE,7
BIC      -60838.39 -60858.68524 -60929.97097
BIC diff      0.00    -20.29489    -91.58063

plot(mod, what = "BIC", ylim = range(mod$BIC[,-(1:2)], na.rm = TRUE),
       legendArgs = list(x = "bottomleft"))

It’s the same if I also used integrated complete-data likelihood (ICL) criterion:

ICL <- mclustICL(X)
plot(ICL)

We run Mclust for all 11 indicators, the data-driven would produce up to 8, 9 clusters.

# tabulate class-membership numbers
table(summary(mod)$classification)


  1   2   3   4   5   6   7   8   9 
 68 137  89  38  52  98  55  79 181

drmod <- MclustDR(mod, lambda = 1)
plot(drmod, what = "contour")

Why happening?

Latent profile homogeneity: even homogeneity is not as straightforward as it is in LCA and FA, latent profile homogeneity is tied to both the within-class mean and the amount of within-class variance for each manifest variable. Meanwhile, the latent profile separation is the degree to which the latent profiles can clearly be distinguished from each other. We can see that some means of classes are very close to each other; so the homogeneity may overlap and hard to separate (for example, the variables Unemployment, PctVacHousing).

# display the means per class
rmarkdown::paged_table(as.data.frame(round(mod$parameters$mean,2)), 
                       options = list(rows.print = 11, cols.print = 9))

Latent profile indicators are continuous and normally distributed within classes. If they are not normally distributed, simulation studies in the context of GMM suggest you will over-extract the number of classes. Obviously, we can see that we had very skewed distribution from each indicators (not sure about normally joint distribution!?).

par(mar = c(4, 4, .1, .1))
plot(density(df2$Unemployment))
plot(density(df2$PctBelowPoverty))

Thus, with the homogeneity and separation concepts:

Homogeneity analogous to concept of saturation in factor analysis
Latent class separation analogous to concept of simple structure in factor analysis
We can have
- High homogeneity, high latent class separation
- High homogeneity, low latent class separation
- Low homogeneity, low latent class separation
- But not low homogeneity, high latent class separation

\(\Rightarrow\) We want to increase the homogeneity. For this purpose, we would use factor analysis to select covariates in the first components.

Factor Analysis

Determine Number of Factors to Extract

ev <- eigen(cor(df.fa[,-1])) # get eigenvalues
ap <- parallel(subject=nrow(df.fa[,-1]), var=ncol(df.fa[,-1]), rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)

Running EFA

When we run a factor analysis, we need to decide on three things:

the number of factors
the method of estimation
the rotation

\(\Rightarrow\) Maximum Likelihood Factor Analysis (bullet 2) entering raw data and extracting 3 factors (bullet 1), with varimax rotation (bullet 3)

fit.f <- factanal(x = df.fa[,-1], factors = 3, n.obs = 797, rotation="varimax")
print(fit.f, digits=2, cutoff=.3, sort=TRUE)


Call:
factanal(x = df.fa[, -1], factors = 3, n.obs = 797, rotation = "varimax")

Uniquenesses:
  AgeDependencyRatio LimitedEngProf5andUP           LessThanHS 
                0.37                 0.00                 0.19 
        Unemployment       PctForeignBorn          FemaleHHPct 
                0.29                 0.10                 0.18 
    MedianIncomeHH1K        AllOcc2009Pct        PctVacHousing 
                0.29                 0.21                 0.60 
     PctBelowPoverty  PublicAssistancePct 
                0.12                 0.07 

Loadings:
                     Factor1 Factor2 Factor3
AgeDependencyRatio    0.56            0.56  
Unemployment          0.80                  
FemaleHHPct           0.84                  
MedianIncomeHH1K     -0.80                  
PctVacHousing         0.61                  
PctBelowPoverty       0.93                  
PublicAssistancePct   0.96                  
LimitedEngProf5andUP          0.99          
LessThanHS            0.53    0.69          
PctForeignBorn                0.91          
AllOcc2009Pct                         0.89  

               Factor1 Factor2 Factor3
SS loadings       4.82    2.43    1.33
Proportion Var    0.44    0.22    0.12
Cumulative Var    0.44    0.66    0.78

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 222.31 on 25 degrees of freedom.
The p-value is 1.46e-33

In general, we’d like to see low uniquenesses or high communalities (subtract the uniquenesses from 1—The communality is the proportion of variance of the \(i^{th}\) variable contributed by the \(m\) (here’s 3) common factors)
At the conclusion of a factor analysis of census data, we might determine that the census measures 3 factors: (1) “mixed” social poverty disparity strength, (2) education strength, (3) age-housing disparity strength
Factor analysis seeks to model the correlation matrix with fewer variables called factors. If we succeed with, said here, 3 factors, we are able to model the correlation matrix using only 3 variables instead of 11. Just remember these 3 variables, or factors, are unobserved. We give them names like “latent variables”. They are not subsets of our original variables.
Given the data matrix \(X\), consider it’s covariance matrix \(cov(X)=\sum\) and obtain the matrix \(V\) of eigenvectors from \(\sum\). This matrix \(V\) is what you call the loadings.

Applying factor scores

This set of eigenvectors define an orthogonal change of basis matrix that maximize the variance from X. This means that, if I project X into the subspace generated by V I will obtain a matrix U by simply solving XV=U. This matrix U is what you call the scores (for each factor, we would have its own score).
In our case, we would manipulate a little bit due to we use only variables in the first factor. Thus, if the absolute value of loadings less than or equal 0.5, it would not contribute to the score

We calculate and show FA scores – The concentrated disadvantage index scores:

df.fa$concdisadv <- as.matrix(df.fa[,-1]) %*% (fit.f$loadings[,1]*(abs(fit.f$loadings[,1])>0.5))
# print out the scores
rmarkdown::paged_table(cbind(tract = df.fa[1:20,1], 
                             condisadv_fascore = df.fa[1:20,13], 
                             df.fa[1:20,2:12]), 
                       options = list(rows.print = 20, cols.print = 13))

Latent Profile Analysis

Data to run LPA

The latent profile analysis (normal mixture) will use variables loaded on factor 1 identified from the factor analysis:

1- AgeDependencyRatio
2- Unemployment
3- FemaleHHPct
4- MedianIncomeHH1K
5- PctVacHousing
6- PctBelowPoverty
7- PublicAssistancePct
8- LessThanHS

# recode into % to scale to other variables
df3 <- df2 %>% 
  select(tract, AgeDependencyRatio, Unemployment, FemaleHHPct, 
         MedianIncomeHH1K, PctVacHousing, PctBelowPoverty,  
         PublicAssistancePct, LessThanHS
         )

Check how many clusters should create?

X <- as.matrix(df3[,-1])
ICL <- mclustICL(X)
summary(ICL)

Best ICL values:
             VVV,3        VVV,4        VVE,7
ICL      -45689.17 -45706.83796 -45709.47096
ICL diff      0.00    -17.66358    -20.29659

plot(ICL)

In the above Mclust() function call, the number of mixing components and the covariance parameterization are selected using the integrated complete-data likelihood (ICL) criterion. A summary showing the top-three models and a plot of the ICL traces (the first fig.) for all the models considered is then obtained.
Note: with BIC criteria

modBIC <- Mclust(X)
summary(modBIC$BIC)

Best BIC values:
            VVE,7        VVE,6        VVV,4
BIC      -45583.8 -45597.13834 -45603.11491
BIC diff      0.0    -13.33485    -19.31142

plot(modBIC, what="BIC", ylim=range(modBIC$BIC[,-(1:2)], na.rm=TRUE), 
     legendArgs = list(x = "bottomright"))

We got at least 4 to 7 classes with the BIC criteria.
Let see the contour plot of estimated mixture densities.

drmod <- MclustDR(mod, lambda = 1)
plot(drmod, what = "contour")

mod.3 <- Mclust(X, G = 3)
drmod.3 <- MclustDR(mod.3, lambda = 1)
plot(drmod.3, what = "contour")

We would use the indication of a 3-component mixture with covariances having different shapes, volume and orientation (VVV). The use can be confirmed by the plot above, there are no clear regions of 7 or 6-component on the plot.

Fitted a 3-components GMM with unconstrained covariance matrices (VVV)

mod.3 <- Mclust(df2[,-1], G=3, modelNames = "VVV") 
summary(mod.3)

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVV (ellipsoidal, varying volume, shape, and orientation)
model with 3 components: 

 log-likelihood   n  df       BIC       ICL
      -29837.77 797 233 -61232.17 -61266.61

Clustering table:
  1   2   3 
189 329 279

Mixture probabilities and mean (sd in separate table below) for Census tract characteristics

sum.mod.3 <- summary(mod.3, parameters = TRUE)
rbind(Component = c("Class 1","Class 2","Class 3"),
      `Mix Probability` = paste0(round((sum.mod.3$pro)*100,1),"%"),
      round(sum.mod.3$mean,1),
      Labels = c("Affluent","Poor","Distressed")
)

                     [,1]       [,2]      [,3]        
Component            "Class 1"  "Class 2" "Class 3"   
Mix Probability      "23.7%"    "41.3%"   "35%"       
AgeDependencyRatio   "35.9"     "50.3"    "67.5"      
LimitedEngProf5andUP "6.6"      "27.3"    "3.3"       
LessThanHS           "5.3"      "24.6"    "19.9"      
Unemployment         "5.3"      "11.3"    "24.6"      
PctForeignBorn       "14"       "33.4"    "4.4"       
FemaleHHPct          "5.8"      "15.5"    "35.3"      
MedianIncomeHH1K     "87.1"     "44.9"    "29.9"      
AllOcc2009Pct        "56"       "64.1"    "66"        
PctVacHousing        "9"        "11"      "20.6"      
PctBelowPoverty      "9.3"      "22.5"    "35.9"      
PublicAssistancePct  "5"        "21"      "41.6"      
Labels               "Affluent" "Poor"    "Distressed"

sd.class1 <- round(sqrt(diag(sum.mod.3$variance[,,1])),1)
sd.class2 <- round(sqrt(diag(sum.mod.3$variance[,,2])),1)
sd.class3 <- round(sqrt(diag(sum.mod.3$variance[,,3])),1)
cbind(sd.class1,sd.class2,sd.class3)

                     sd.class1 sd.class2 sd.class3
AgeDependencyRatio        18.6      14.1      15.9
LimitedEngProf5andUP       5.1      12.6       5.2
LessThanHS                 4.4      14.1       8.3
Unemployment               2.6       4.4       9.8
PctForeignBorn             7.2      11.9       5.8
FemaleHHPct                4.0       6.6      11.3
MedianIncomeHH1K          22.9      12.5      11.0
AllOcc2009Pct             16.0      11.3      12.5
PctVacHousing              6.5       4.9       9.7
PctBelowPoverty            5.1       9.6      14.0
PublicAssistancePct        4.1       9.9      14.0

Applying the clustering to mapping on Chicago Neighborhood map

Clustering in class number

df2$ConcDisadv_cluster <- mod.3$classification #13th in order
with(df2, table(ConcDisadv_cluster))

ConcDisadv_cluster
  1   2   3 
189 329 279

Probability for each tract in each class

df2[,14:16] <- mod.3$z
names(df2)[14:16] <- c("ConcDisadv_prob1", 
                       "ConcDisadv_prob2", 
                       "ConcDisadv_prob3")

Concentrated Disadvantage Clustering Selection based on the highest probability

rmarkdown::paged_table(cbind(tract = df2[1:20,1], 
                             `Concentrated Disadvantage` = df2[1:20,13],
                             df2[1:20,14:16],
                             df2[1:20,2:12]),
                       options = list(rows.print = 20, cols.print = 13))

With the level:

Class 1: not/mild disadvantage
Class 2: medium disadvantage
Class 3: highly disadvantage

We can map on Chicago Neighborhood map (using ArcGIS) as such

Classification by race-ethnicity

r_eth.df <- read.csv("data/Chicago Census Data by Tract_ACS 2015.csv", header = TRUE) %>% 
  select(tract, PctBlack,PctAsian,PctHispanicLatino,PctWhiteNonHisp)
dff <- merge(df2, r_eth.df, by="tract")

mclust1Dplot(dff[,"PctBlack"], parameters = mod.3$parameters, z = mod.3$z, 
             what = "classification", main = FALSE)
title("Percent Black")

mclust1Dplot(dff[,"PctAsian"], parameters = mod.3$parameters, z = mod.3$z, 
             what = "classification", main = FALSE)
title("Percent Asian")

mclust1Dplot(dff[,"PctHispanicLatino"], parameters = mod.3$parameters, z = mod.3$z, 
             what = "classification", main = FALSE)
title("Percent Hispanic Latino")

mclust1Dplot(dff[,"PctWhiteNonHisp"], parameters = mod.3$parameters, z = mod.3$z, 
             what = "classification", main = FALSE)
title("Percent White")

We can see that:

black (bottom): joint of 3 classes
grean: class 3 (highly disadvantage)
red: class 2 (medium)
blue: class 1 (none disadvantage)

Thus, neighborhoods with high percent Black focus on class 3 (the first graph). It’s quite contradict with high percent White (i.e. neighborhoods with high percent White focus on class 1, the last graph).

Now, how can we turn the visualization into an analysis.
I will discuss this in a future post.

What could we do if Latent Profile Analysis (LPA) run not as expected? – LPA of Chicago Neighborhoods